Search at Über over Apache Lucene 


Sia (Egyptian God of Perception) 
or 
Search In Action 


Yupeng Fu, Uber 


About Me Über 


e Yupeng Fu 

e Search Platform (Sia) 

e Real-time Data Platform 

e yupeng9@github 

e Apache Pinot Committer/PMC 
e Alluxio PMC 


Vision 


A scalable, high-performance platform 
supporting flexible ranking 
for Uber’s search and discovery needs 
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Eats 


Feed and Search are two main entries for consumers to discover Eats inventories 
including restaurants, dishes, grocery stores, etc 
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40-50 min 36% (10) 


Grilled Spicy Beef Tacos ES Shredded Chicke 


$14.00 $10.50 


Burger King (2102 Middlefield Road) 44 
Sponsored - 28: $3.49 Delivery Fee * 25-35 min 


$0 Delivery Fee (Spend $20) 
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Starbucks (San Carlos & Laurel) 43 Œ + $2.49 Delivery Fee + 25-35 min 
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Eats Feed: The Choice Problem 


There are 1M+ restaurants on Eats. The average Eater has 1000- 
restaurants/dishes to choose from when they open the app / 
website. 


Variable intent 
They don't necessarily know what they want 


Amy's Ice Cream 


20-30Min 48% $250 Fee 


Complex decision making 
Selection (cuisines, restaurants, dishes), price, speed, reliability, | 
dietary restrictions, party size and preferences. 


Discovery is about helping Eaters decide what they want to eat 
or buy. 


Just Mac 


20-30 Min 48% $2.50 Fee 


Pindrop:Build a magical pickup/dropoff experience for everyone 
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Enroute flow 


How do we determine the pickup location? 


Two Steps: 


t+ 


1. Where is the Rider? (Request Location) 
2. Where should the rider get picked up? (Rendezvous or) 


Location) A E 


We serve 500+ locations per second across rides and eats ` 


Confirm your pickup spot 


Delmonico Apartments Search 


Confirm Pick di 


Search powered pickups 


Deterministic & limited set of locations 

Features be computed offline globally 

Sia supports ID based lookups 

Search for pre-generated Pickup Locations 

Rank based on request parameters and user | sœ 
preferences 


Unified Build 


Indigo: Real-time geospatial search for Ride’s match 
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Requests 


Geosearch: Uber's in-house location search engine 
` 


Geocoding ——e 


Autocomplete & =o 


Full Text Search 
Reverse Geocoding ——e 


Predictions ——e 


Personalization ——e 


Personalization ——e 
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Current location 


Grand 


Set location on map 


Home 
1234 Main St 


Work 
5678 Ist Ave 


Saved Places 
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Category Search—e 


Personalization ——e 


Personalization ——e 


Predictions ——e 


Predictions ——e 


e Multi-billions queries globally per month 


SAMSUN 


Home 
1234 Mission St 


Saved Places 


San Francisco International... 
700 Airport Blvd 


Dogpatch Boulders 
2573 3rd St, San Francisco.. 


Autocomplete & e 
Full Text Search 


Reverse Geocoding ——e 


Personalization ——e 


Personalization ——e 


Use current location 
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ï Saved places 


Profile 
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When 
© Assoon as possible 


Œ] Schedule an order 


Switch 
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Challenges 


Large volume of serving traffic: tens of thousands QPS 
Geospatial data heavy 
o Discoverability 
o Deliverability 
e Frequent updates 
o  Driver/Rider location 
o Store open/close 
o Availability of dishes/items 
e Complex ranking methods 
o Fulfillment match 
o Semantic search 


e Integrate with Uber infra components 


Evolution from ElasticSearch®over Lucene® 


Updates to index are costly: 
o Pushed at document granularity 
o Includes database like consistency guarantees, replication logs 
o Search nodes have to double as indexing nodes 

Very hard to rebuild index 

Schema changes difficult 

Obsolete data in index difficult to delete 

Document order in index cannot be specified 

Query and ranking functionality not tuned for Uber’s needs 

Difficult to integrate with Uber components 
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Sia Architecture BEN 
om ! Results 


Updates 


Source of Truth 
(Online + Offline) 


Sia Overview 


Powered by Lucene e Geo-sharding 
Inspired by Linkedin’s Galene o Latitude-based 
Ingestion o Hexagon-based 
o Offline index building e Static ranking and early termination 
o Streaming index updates e Integrates with Uber components 
Serving o gRPC, Muttley 
UE YA TAIYA o Rate limiting, circuit breaker 
o Apache Kafka®, ELK®, M3 
o Aggregator o Terrablob(ObjectStore), Hadoop 
o Schemaless, DocStore 
o Security/Compliance 
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Sia Overview - Indexing 


e Lambda Architecture for indexing 
e Three layer Index 


o Base Index: Immutable index of Lucene segments built periodically 

o Snapshot Index: Continuous snapshots of Lucene segments for node 
recovery, before getting merged into base eventually 

o Live Index 


= Support field level updates 


= In-memory implementation of lucene datastures with lock-free 
updates 
m Updates are immediately available for querying (vs waiting for 


“commit” on ES) 


The Base Index 


The Base Index 


Normal Lucene index 

Mapped to memory for serving 

Offline index builds - no impact on search performance 
Clean reset 

Schema changes, versioning become easy 

Index building can be parallelized via mini shards 
Index merge from mini shards 

Can control order of documents in index 


The Live Index 


Sia Stream 


The Live Index 


In-memory data structure 

Ingestion from Kafka 

Lock-free updates 

Updates at finest level of granularity 

Updates have almost no impact on query latencies 

Layered on top of Base Index (overrides Base Index) 

Tombstoning to support deletes 

Preserves document ordering in Base Index 

Supports addition of new documents 

Off-heap SkipList for better memory management and quick access 


The Snapshot Index 


The Snapshot Index 


Normal Lucene index 

Periodically persisted from Live Index 

More compacted form by merging live index 

Layered on top of Base Index (overrides Base Index) 
Layered below Live Index (overridden by Live Index) 
New documents carried over from Live Index 
Tombstones carried over from Live Index 

Preserves document ordering in Base Index 
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Basic Query Infrastructure 


e Lucene queries are implemented recursively 
e Only term queries (the leaves in query expressions) operate directly on index 
e So we only need to change the term query implementation 

So: st:orderable 

Becomes: st:orderable, | OR st:orderable hot OR st:orderable, | 


with consideration for tombstones and optimizations 


e Therefore all Lucene and ElasticSearch queries remain supported 


Serving Stack 


Aggregator + sharded search nodes 

gRPC API - same API at aggregator and search nodes 

Query understanding and query rewriting at aggregator 

Aggregator forwards rewritten query to search nodes 

Search nodes perform retrieval and at least one ranking pass 

Aggregator combines results from search nodes 

Aggregator can be used as a library - with in-process gRPC call (for Java only) 


ee 22-2220 2222-22-22 2-22 AA 


Aggregator Sia Cluster 1 


Server (Shard 0) Server (Shard 15) 
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Geo-Sharding 


Latitude-based sharding 
Divide the world map into narrow stripes 
Buffer degree for over indexing the docs in the search radius 


© 
e 
e 
e Cover all timezones for serving 


Shard 1 


DENT Tai S Share ADAI 
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Shard 2 


Geo-Sharding 


e L.aitude-based sharding challenges 
o Shard size skewness 
o Unnecessary scanning in retrieval 


Overlay of European Cities onto North America 


https://brilliantmaps.com/cities-transposed-latitude/ 


Geo-Sharding 


Hexagon-based sharding 

Tile the world map into hexagons 

Use higher-resolution hexagons for buffer 
Evenly sized shards via bin-packing solution 


Time travel to earlier index version 


Index corruption is hard to mitigate 
e Rolling back to previous index took hours / days 


e Building a new index takes hours / days 
== Snapshot to 


J— release 
Sia Cluster memory 
(Previous Version) 


K 7 v 


8+ hours to catch up Indexes + 


Snapshots 
(HDFS / S3) 


Sia Cluster 
(Current Version) 


Time travel: keep all versions fresh with snapshot cluster 


Query Aggregator 


Sia Cluster 
(Current Version) 
Previous Version (Disk) 


Continuous 


Load Snapshot Ingestion 


Save Snapshot 


Indexes + 
Snapshots 
(HDFS / S3) 


Snapshot Cluster AF 
(Current Version) 
(Previous Version) 


Time travel: Mitigation flow 


Query Aggregator 


Time 
Travel 
(20min) 


Sia Cluster 
(Current Version) 
Previous Version (Disk) 


Continuous 
Ingestion 


Load Snapshot 


Save Snapshot Snapshot Cluster 
(Current Version) 


(Previous Version) 


Indexes + 
Snapshots 
(HDFS / S3) 


Time travel: Mitigation flow 


Query Aggregator 


Time 
Travel 
(20min) 


Sia Cluster 
Current Version (Disk) 
(Previous Version) 


Continuous 
Ingestion 


Load Snapshot 


Save Snapshot Snapshot Cluster 
(Current Version) 


(Previous Version) 


Indexes + 
Snapshots 
(HDFS / $3) 


Semanfic Search 


Next-gen search architecture 
Incorporating semantic signals like context 
Deep integration with ML 
Better search quality 
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New York 


Downtown conrad ( 


Conrad New York 
102 N End Ave, New York City 


(downtown) 


Conrad New York (midtown) 
151 W 54th St, New York City 


Conrad 
102 N End Ave, New York, NY 


Conrads Famous Bakery 
856 Utica Ave, Brooklyn NY 


Conrad's Pharmacy 
333 Long Beach Rd, Island Park, NY 


Set location on map 


KNN in Sia 


Built on top of Lucene’s KNN 
Geo-sharded HNSW graphs at Uber scale 
Pre-filter with location conditions 

Filter Cache for latency improvement 


Train Ingest 
(<=) Index 


Build 


Results i HNSW base Remote Index 
Storage 


HNSW 
Calculate snapshot 


Ingest 


Live 
updates 


Sia Leaf 


Uber 


Q&A 
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these marks. ElasticSearch® is registered trademark of Elastic Company. 


